Frontiers in Bioinformatics
○ Frontiers Media SA
Preprints posted in the last 30 days, ranked by how well they match Frontiers in Bioinformatics's content profile, based on 45 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Fletcher, W. L.; Sinha, S.
Show abstract
The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.
Richardson, E.; Aarts, Y. J. M.; Altin, J. A.; Baakman, C. A. B.; Bradley, P.; Chen, B.; Clifford, J.; Dhar, M.; Diepenbroek, D.; Fast, E.; Gowthaman, R.; He, J.; Karnaukhov, V.; Marzella, D. F.; Meysman, P.; Nielsen, M.; Nilsson, J. B.; Deleuran, S. N.; Parizi, F. M.; Pelissier, A.; Pierce, B. G.; Rodriguez Martinez, M.; Roran A R, D.; Saravanakumar, S.; Shao, Y.; Smit, N.; Van Houcke, M.; Visani, G. M.; Wan, Y.-T. R.; Wang, X.; Woods, L.; Wuyts, S.; Xiao, C.; Xue, L. C.; IMMREP25 Participant Consortium, ; Barton, J.; Noakes, M.; May, D. H.; Peters, B.
Show abstract
T cell receptors (TCRs) can bind to peptides presented by MHC molecules (pMHC) as a first step to trigger a T cell response. Reliable approaches to predict TCR:pMHC binding would have broad applications in clinical diagnostics, therapeutics, and the fundamental understanding of molecular interactions. IMMREP is a community organized series of prediction contests that asks participants to predict TCR:pMHC binding on unpublished datasets. Previous iterations in 2022 and 2023 showed multiple approaches can predict TCR-pMHC binding with significant accuracy (median AUC_0.1[≥]0.7) for peptides where experimental data is available ("seen" peptides). In contrast, models did not outperform random guessing for peptides that have no such data available ("unseen" peptides). Here we report on the results of IMMREP25, which focused solely on unseen peptides in order to evaluate the cutting edge of the field. We received 126 named submissions predicting the specificity of 1,000 TCRs against twenty unseen peptides restricted by one of two MHC molecules (HLA-A*02:01 and HLA-B*40:01). The best performing methods showed a macro-AUC_0.1 of 0.60, significantly better than random, demonstrating significant advances in the field. The top performing methods incorporated structural modeling into their approach, indicating that especially for unseen peptides, a structural understanding aids in the prediction of TCR:pMHC interactions. The results from this benchmark highlight the significant challenges remaining for TCR:pMHC predictions and will inform future method development.
Vliora, A.; Tiberti, M.; Papaleo, E.
Show abstract
MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance, whereas only three (NUPR1, TSC1, and TMEM127) showed poor agreement. The number of disagreements was higher at sites with low AlphaFold2 confidence for NUPR1 and TSC1. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.
Brate, J.; Grande, E. G.; Pedersen, B. N.; Frengen, T. G.; Stene-Johansen, K.
Show abstract
Here we evaluated the performance of a previously published tiling PCR primer scheme by Ringlander et al. (2022) for whole-genome amplification of Hepatitis B virus (HBV) in combination with Oxford Nanopore sequencing. The primer set originally developed for Ion Torrent sequencing was adapted by removing platform-specific adapters and tested using clinical serum or plasma samples submitted for routine HBV genotyping and resistance testing. Two multiplexing strategies were compared: a single PCR pool containing all primers and a two-pool strategy with non-overlapping amplicons. Sequencing reads were processed using a Nanopore analysis pipeline, and genome coverage and amplicon performance were compared across samples spanning a wide Ct range and representing HBV genotypes A-E. Across all samples, the median genome coverage was approximately 50%, although recovery varied widely, ranging from complete failure to nearly full genomes. Combining all primers into a single PCR reaction, or separating overlapping amplicons into different reactions, had little overall impact on genome recovery, and no consistent differences between the two pooling strategies were observed. In contrast, amplification efficiency differed markedly between individual amplicons. Amplicons 1-5 generally produced higher sequencing depth, whereas amplicons 6-10 frequently showed low coverage and contributed to incomplete genome recovery. Genome coverage was strongly associated with Ct values, with higher coverage observed in samples with lower Ct values, while coverage was broadly similar across genotypes. These results demonstrate that the Ringlander et al. primer scheme can be adapted for multiplex PCR and Nanopore sequencing of HBV, but uneven amplicon performance limits consistent full-genome recovery and highlights the need for further optimization of HBV tiling PCR designs.
Zhang, X.
Show abstract
Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.
Lee, H.; Kim, H.
Show abstract
Background: CD276 has been proposed as a candidate gene associated with the biological characteristics of meningioma, but its predictive position and interpretive significance within a transcriptomic classifier have not yet been clearly established. Accordingly, this study aimed to evaluate CD276 stepwise across internal model development, external validation, calibration, decision-analytic assessment, feature stability, and robustness analyses using public transcriptomic cohorts. Methods: The analyses in this study were organized into two interconnected notebooks. In Notebook A, we reconstructed the internal training cohort (GSE183653), evaluated the CD276 single-gene signal, and then developed a transcriptome-wide multigene classifier. We also performed permutation importance, bootstrap confidence interval, label permutation test, repeated cross-validation, CD276 ablation, and internal calibration analyses. In Notebook B, we reproduced the external validation cohort (GSE136661) in a fixed common-gene space, applied train-only recalibration and train-only threshold transfer, and extended the interpretation through decision curve analysis, stability analysis, enrichment analysis, and one-factor-at-a-time robustness analysis. Results: The internal training cohort consisted of 185 samples and 58,830 genes, of which 25 were WHO grade III cases. CD276 expression showed a significant association with WHO grade, but the internal discrimination of the CD276-only baseline was limited (ROC-AUC 0.628, average precision 0.323, balanced accuracy 0.540). In contrast, the initial transcriptome-wide model showed ROC-AUC 0.834 and PR-AUC 0.509, and under 5-fold cross-validation, the canonical fulltranscriptome model and the CD276-forced 5,001-feature branch showed mean ROC-AUC/PR-AUC of 0.854/0.564 and 0.855/0.606, respectively, outperforming the CD276-only baseline at 0.644/0.391. CD276 was not included in the initial 5,000-feature filtered set and ranked 900th among 5,001 features even in the forcibly included 5,001-feature branch. In paired ablation analysis, the performance difference attributable to inclusion of CD276 was effectively close to zero (delta ROCAUC 0.000062, delta PR-AUC 0.000056). Internal calibration analysis showed an overconfident probability pattern (Brier score 0.10501, intercept -1.421392, slope 0.413241). In external validation, the fixed multigene pipeline achieved ROC-AUC 0.928 and PR-AUC 0.335. Train-only recalibration improved calibration metrics while preserving discrimination, and decision curve analysis showed threshold-dependent but limited external utility. Stability analysis showed overlap between core-stable genes and high-impact genes, but CD276 was not supported as a dominant stable core feature and remained in the target-of-interest tier. In robustness analysis, some perturbations preserved the primary interpretation, whereas others revealed transform sensitivity or an alternative high-performing feature-space solution. Conclusions: CD276 is a gene of interest associated with meningioma grade, but it was difficult to interpret it as a strong standalone predictor or a dominant stable classifier feature. In this study, the main basis of predictive performance lay not in CD276 alone but in a broader multigene transcriptomic structure, and probability output needed to be interpreted conservatively with calibration taken into account. These findings position CD276 not as a direct single-gene classifier but as a biologymotivated target-of-interest that should be interpreted within a broader transcriptomic program.
Maier, J.; Gin, C.; Rabasco, J.; Spencer, W.; Bass, A.; Duerkop, B. A.; Callahan, B.; Kleiner, M.
Show abstract
BackgroundTransduction is a form of horizontal gene transfer in which bacterial DNA is packaged and transferred by virus-like particles (VLPs). Transductomics is a sequencing-based method used to detect DNA carried by VLPs. During transductomics analysis, reads from a samples ultra-purified VLPs are mapped to metagenomic contigs assembled from the same samples whole-community. The read mapping produces coverage patterns that require a time-consuming manual inspection and classification process which makes the methods use unfeasible for datasets with many samples. ResultsWe developed a novel algorithm, TrIdent (Transduction Identification), that uses pattern-matching to automate the transductomics data analysis and that is available as an R package (https://jlmaier12.github.io/TrIdent/). There is no software equivalent to TrIdent so we compared TrIdents classifications of transductomics datasets to classifications made by human classifiers. TrIdents classifications were generally comparable to the manual classifications on a previously generated, manually classified transductomics dataset. When applied to newly generated transductomics data from the murine microbiota, TrIdent agreed with two independent human classifiers as much as the two independent human classifications agreed with each other. TrIdent classified transductomics datasets in a fraction of the time needed by human classifiers, and the classifications produced by TrIdent are fully reproducible. We used TrIdent to explore three murine gut transductomes and found that bacterial DNA associated with the Oscillospiraceae and Turicibacteraceae families was highly enriched in the DNA packaged by VLPs as compared to the whole community metagenomes. ConclusionsThe TrIdent software is a more accessible, more efficient, and more reproducible alternative to the manual inspection of read coverage patterns previously required for transductomics data analysis. To demonstrate the application of TrIdent, we analyzed transductomics datasets from murine fecal pellets and showed that specific low abundance bacterial families appear to be heavily involved in transduction.
Chandra, S.
Show abstract
Background: Current deep learning models in computational pathology, radiology, and digital pathology produce opaque predictions that lack the explainable artificial intelligence (xAI) capabilities required for clinical adoption. Despite achieving radiologist-level performance in tasks from whole-slide image (WSI) classification to mammographic screening, these models function as black boxes: clinicians cannot trace predictions to specific biological features, verify outputs against established morphological criteria, or integrate AI reasoning into precision oncology workflows and tumor board decision-making. Methods: We present Virtual Spectral Decomposition (VSD), a modality-agnostic, interpretable-by-design framework that decomposes medical images into six biologically interpretable tissue composition channels using sigmoid threshold functions - the same mathematical structure as CT windowing. Unlike post-hoc xAI methods (Grad-CAM, SHAP, LIME) applied to black-box deep learning models, VSD channels have pre-defined biological meanings derived from tissue physics, providing inherent explainability without sacrificing quantitative rigor. For whole-slide image (WSI) analysis in digital pathology, we introduce the dendritic tile selection algorithm, a biologically-inspired hierarchical architecture achieving 70-80% computational reduction while preferentially sampling the tumor immune microenvironment. VSD is validated across three cancer types and imaging modalities: pancreatic ductal adenocarcinoma (PDAC) on CT imaging, lung adenocarcinoma (LUAD) on H&E-stained pathology slides using TCGA data, and breast cancer on screening mammography. Composition entropy of the six-channel vector is computed as a visual Biological Entropy Index (vBEI) - an imaging biomarker quantifying the diversity of active biological defense systems. Results: In pancreatic cancer, the fat-to-stroma ratio (a novel CT-derived radiomics biomarker) declines from >5.0 (normal) to <0.5 (advanced PDAC), enabling early detection of desmoplastic invasion before mass formation on standard imaging. In lung cancer, composition entropy from H&E whole-slide images correlates with tumor immune microenvironment markers from RNA-seq (CD3: rho=+0.57, p=0.009; CD8: rho=+0.54, p=0.015; PD-1: rho=+0.54, p=0.013) and predicts overall survival (low entropy immune-desert phenotype: 71% mortality vs 29%, p=0.032; n=20 TCGA-LUAD), providing immune phenotyping for checkpoint immunotherapy patient selection from a $5 H&E slide without molecular assays. In breast cancer, each lesion type produces a characteristic six-channel fingerprint functioning as an interpretable computer-aided diagnosis (CAD) system for quantitative BI-RADS assessment and subtype classification (IDC vs ILC vs DCIS vs IBC). A five-level xAI audit trail provides complete traceability from clinical decision support output to specific biological structures visible on the original images. Conclusion: VSD establishes a unified, interpretable-by-design mathematical framework for explainable tissue composition analysis across imaging modalities and cancer types. Unlike black-box deep learning and post-hoc xAI approaches, VSD provides inherently interpretable, clinically verifiable cancer detection and immune phenotyping from standard clinical imaging at existing costs - without requiring foundation model infrastructure, specialized hardware, or molecular assays. The open-source pipeline (Google Colab, Supplementary Material) enables immediate reproducibility and extension to additional cancer types across the pan-cancer TCGA atlas.
Xu, Y.; Zhang, X.; Chen, W.; Li, Y.; Lu, L.; Huang, R.; Liao, J.; Li, H.; Zheng, W.
Show abstract
PurposeDifferentially expressed genes (DEGs) between colorectal cancer liver metastasis (CRLM) epithelium and primary colorectal cancer (CRC) epithelium (LMR DEGs) identified based on single-cell RNA sequencing (scRNA-seq) data may become new biomarkers for CRC prognosis. MethodsAn scRNA-seq dataset was used to describe the cellular landscape of primary CRC and CRLM and identify LMR DEGs. Prognostic LMR DEGs were identified in the bulk RNA-seq dataset. Based on the prognostic LMR DEGs, multiple machine learning algorithm combinations were compared in terms of their C-index, and the best model was selected for the construction of the LMR score. ResultsAmong the 2070 LMR DEGs, 426 prognostic LMR DEGs were ultimately obtained. The combination of the randomized survival forest (RSF) model and ridge regression had the highest C-index and was therefore used to construct a 15-gene scoring system (LMR score). In the external validation set, the 1- and 5-year AUCs of the LMR score were greater than those of the AJCC stage and other scoring systems constructed with a similar dataset. In addition, the LMR score was closely associated with factors that influence CRC outcomes, such as immune infiltration. ConclusionThe LMR score may be a reliable new biomarker for predicting the prognosis of patients with CRC.
WANG, Z.; Arsuaga, J.
Show abstract
Computational bacteriophage host prediction from genomic sequences remains challenging because host range depends on diverse, rapidly evolving genomic determinants--from receptor-binding proteins to anti-defense systems and downstream infection compatibility--and because the signals available to predictors, including sequence homology, CRISPR spacer matches, nucleotide composition, and mobile genetic elements, are sparse, unevenly distributed across taxa, and constrained by incomplete host annotations. Here, we frame host prediction as an unsupervised retrieval problem. We asked whether embeddings from the pretrained genome language model Evo2 captured a reliable host-range signal without training on phage-host labels. We generated whole-genome embeddings for phages and candidate bacterial hosts with the Evo2-7B model, applied normalization, and ranked hosts by cosine similarity. Using the Virus-Host Database, we selected embedding and fusion choices on a Gram-positive validation cohort and then evaluated the approach on a held-out Gram-negative test cohort to minimize data leakage. We found that Evo2 was strongest at retrieving multiple plausible hosts, with the recorded host in the top 10 for 55.4% of phages. However, it did not maximize species-level top-1 accuracy (19.4% vs. 23.2% for the best baseline). At higher taxonomic ranks, Evo2 captured a coarser host-range signal: top-1 accuracy reached 43.4% at the genus level and 51.6% at the family level. Reciprocal rank fusion of Evo2 with BLASTN, VirHostMatcher, and PHIST improved all retrieval metrics. Top-10 retrieval rose to 58.5% and top-1 accuracy to 26.9%. Stratified analyses by phage genome length, host clade, and host mobile genetic element coverage revealed scenario-dependent performance. Evo2 embeddings excelled for intermediate-length phages and when host mobile element content was low, whereas alignment and k-mer methods dominated when local homology was abundant. These results suggest that pretrained genome embeddings complement established alignment- and k-mer/composition-based methods and that context-aware hybrid pipelines may help improve phage host prediction. Author summaryBacteriophages are viruses that prey on bacteria and play central roles in microbial ecosystems, nutrient cycling, and the spread of antibiotic resistance genes. Knowing which bacterium a phage can infect is important for applications such as phage therapy, where viruses are used to treat bacterial infections, but making this prediction from DNA sequence data alone remains difficult. Existing computational tools each exploit different types of genomic evidence, and none works reliably across all settings. We asked whether an artificial intelligence model trained to read raw DNA--without ever being shown which phages infect which hosts--could contribute a new, complementary signal. We found that this approach was particularly effective at narrowing the field to a short list of candidate hosts and at capturing broad evolutionary relationships between phages and bacteria. When we combined it with established sequence-comparison tools, overall prediction improved beyond what any single method achieved alone. By examining when each method succeeded or failed, we identified biological factors that govern prediction difficulty, offering practical guidance for building more robust prediction systems.
Albuja, D. S.; Maldonado, P. S.; Zambrano, P. E.; Olmos, J. R.; Vera, E. R.
Show abstract
Accurate fungal species identification is critical for microbial ecology, food safety, and plant pathology. However, morphological limitations and genomic complexity hinder this process. Molecular markers such as the ITS region, along with Oxford Nanopore long-read sequencing, offer a robust solution, albeit limited by error rates in homopolymeric regions and a high dependence on advanced computational resources (GPUs) to achieve high accuracy. This study benchmarks two bioinformatics workflows on a multiplexed dataset of complex fungal communities to address this technological gap: a CPU-based workflow optimized using a Bayesian machine learning engine and a GPU-accelerated workflow incorporating "super high accuracy" (SUP) models and refinement with neural networks. The results establish a scalable framework for evaluating the impact of computational architecture on final taxonomic resolution. It is demonstrated that GPU processing maximizes data retention and species-level accuracy by correcting systematic errors. Alternately, implementing automated hyperparameter optimization in CPU environments stabilizes sequence clustering and achieves high taxonomic concordance at the genus level. This conceptual advance validates the feasibility of performing ITS metabarcoding analysis in resource-constrained infrastructures, thus providing the scientific community with a reproducible protocol that balances the need for taxonomic precision with hardware availability.
Zhang, Y.; Chen, Z.; Zheng, C.; Peng, X.; Lu, Y.; Zhang, J.; Sun, P.
Show abstract
Colonic adenocarcinoma (COAD) is a major cause of cancer-related mortality worldwide. Various tumors are linked to metastasis-associated in colon cancer 1 (MACC1). This study aimed to analyze public datasets to examine MACC1 expression, signaling pathways, copy number variations, and associations with immune cell subsets in COAD employing bioinformatics. MACC1 expression was elevated in COAD, especially in Wnt signaling and chromatin modifier pathways. Analysis of somatic copy number alterations in The Cancer Genome Atlas-COAD dataset revealed a link between MACC1 and DNA damage repair. MACC1 also showed a negative correlation with genes involved in immune cell infiltration in patients with COAD, including cluster of differentiation (CD)8+ T cells, activated dendritic cells, CD8 T cells, and cytotoxic cells. Collectively, these findings suggest MACC1 as a potential prognostic biomarker and therapeutic target for COAD.
Haque, N.; Mazed, A.; Ankhi, J. N.; Uddin, M. J.
Show abstract
Accurate classification of SARS-CoV-2 genomic variants is essential for effective genomic surveillance, yet it is challenged by extreme class imbalance, limited representation of rare variants, and distribution shifts in real-world sequencing data. In this study, we employed hybrid RF-SVM framework designed for robust detection of rare SARS-CoV-2 variants. It integrates a random forest and a polynomial-kernel based support vector machine to enhance sensitivity to minority classes while maintaining overall predictive stability. We systematically compared classical machine learning models, deep learning approaches, and hybrid strategies under both standard and distribution-shifted evaluation settings. Our results show that classical models using TF-IDF-based k-mer features outperform deep learning methods on macro-averaged performance metrics. The Random Forest classifier using TF-IDF Feature achieved the best overall performance, with a macro-averaged F1-score of 0.8894 and an accuracy of 96.3%. The model also demonstrated strong generalization ability, as evidenced by stable cross-validation performance (CV accuracy = 0.9637). Hybrid RF-SVM model further improves rare variant detection under severe class imbalance. Calibration analysis indicates reliable probability estimates for common variants, although challenges persist for minority classes. Overall, this study highlights the limitations of deep learning in highly imbalanced genomic settings and demonstrates that carefully designed hybrid machine learning approaches provide an effective and interpretable solution for rare SARS-CoV-2 variant detection.
Hughes, N.; Hogenboom, J.; Carter, R.; Norman, L.; Gouthamchand, V.; Lindner, O.; Connearn, E.; Lobo Gomes, A.; Sikora-Koperska, A.; Rosinska, M.; Pogoda, K.; Wiechno, P.; Jagodzinska-Mucha, P.; Lugowska, I.; Hanebaum, S.; Dekker, A.; van der Graaf, W.; Husson, O.; Wee, L.; Feltbower, R.; Stark, D.
Show abstract
Background: Population-based cancer registers (PBCR) are important for monitoring trends in cancer epidemiology, facilitating the implementation of effective cancer services. Adolescents and Young Adult (AYA) with cancer are a patient group with a unique set of needs. The utility of PBCR in AYA is limited by the lack of AYA-specific data items. STRONG AYA, an international multidisciplinary consortium is addressing this through federated learning (FL) methodology and novel data visualisation concepts. A Core Outcome Set (COS) has been developed to measure outcomes of importance through clinical data and Patient Reported Outcomes (PROs). We describe how data from the Yorkshire Specialist Register of Cancer in Children and Young People (YSRCCYP), a PBCR in the UK is being used within STRONG AYA and how the subsequent analyses can guide patient consultations. Methods: Data from the YSRCCYP were imported into a Vantage 6 node, from which FL analyses are performed along with data provided by other consortium members. The results are extracted into the PROMPT software and integrated into patient electronic healthcare records. Results: Healthcare professionals can view the results of individual PROs at various time points and in comparison, to summary analyses carried out within the STRONG AYA infrastructure. Results can be filtered by age, disease, country and stage. Conclusion: We have demonstrated how a regional PBCR can contribute to a pan-European infrastructure and analyses viewed to enhance patient consultations. Such analyses have the potential to be used for research and policy-making, improving outcomes for AYA.
Mayala, S.; Mzurikwao, D.; Suluba, E.
Show abstract
Deep learning model classification on large datasets is often limited in countries with restricted computational resources. While transfer learning can offset these limitations, standard architectures often maintain a high memory footprint. This study introduces HybridNet-XR, a memory-efficient and computationally lightweight hybrid convolutional neural network (CNN) designed to bridge the domain gap in medical radiography using autonomous self-supervised learning protocols. The HybridNet-XR architecture integrates depthwise separable convolutions for parameter reduction, residual connections for gradient stability, and aggressive early downsampling to minimize the video RAM (VRAM) footprint. We evaluated several training paradigms, including teacher-free self-supervised learning (SSL-SimCLR), teacher-led knowledge distillation (KD), and domain-gap (DG) adaptation. Each variant was pre-trained on ImageNet-1k subsets and fine-tuned on the ChestX6 multi-class dataset. Model interpretability was validated through gradient-weighted class activation mapping (Grad-CAM). The performance frontier analysis identified the HybridNet-XR-150-PW (Pre-warmed) as the optimal configuration, achieving a 93.38% average accuracy and 99% AUC while utilizing only 814.80 MB of VRAM. Regarding class-wise accuracy, this variant significantly outperformed standard MobileNetV2 and teacher-led models in critical diagnostic categories, notably Covid-19 (97.98%) and Emphysema (96.80%). Grad-CAM visualizations confirmed that the teacher-free pre-warming phase allows the model to develop sharper, anatomically grounded focus on pathological landmarks compared to distilled models. Specialized pre-warming schedules offer a viable, computationally autonomous alternative to knowledge distillation for medical imaging. By eliminating the requirement for high-performance teacher models, HybridNet-XR provides a robust and trustworthy diagnostic foundation suitable for clinical deployment in resource-constrained environments. Author summaryTraditional deep learning models for medical imaging are often too large for the low-power computers available in many global health settings. We developed a new model to bridge this computational gap. We designed HybridNet-XR, a highly efficient AI architecture, and trained it using a "teacher-free" method that doesnt require a massive supercomputer. We found a specific version (H-XR150-PW) that provides high accuracy while using very little memory. Our results show that high-performance diagnostic AI can be deployed on standard, low-cost hardware. Furthermore, using visual heatmaps (Grad-CAM), we proved that the AI correctly identifies medical landmarks like lung opacities, ensuring it is safe and reliable for real-world clinical use.
Tartaglia, J.; Giorgioni, M.; Cattivelli, L.; Faccioli, P.
Show abstract
BackgroundAdvances in high-throughput DNA sequencing technologies have dramatically reduced the time and cost required to generate genomic data. As sequencing is no longer a limiting factor, increasing attention must be paid to optimizing the analyses of the large-scale datasets produced. Efficient processing of such data is essential to reduce computational time and operational costs. In this context, workflow management systems (WMSs) have become key instruments for orchestrating complex bioinformatic pipelines. Among these systems, Nextflow has emerged as one of the most widely adopted solutions in bioinformatics. MethodsTo improve scalability and computational efficiency, we employed Nextflow to re-design an already existing pipeline dedicated to the analysis of MNase-defined cistrome-Occupancy (MOA-seq) data. The re-engineering process focused on modularizing the workflow and integrating containerization technologies to ensure reproducibility and easier deployment across heterogeneous computing environments. ResultsThe resulting workflow, named MOAflow, represents a modernized and fully containerized pipeline for MOA-seq data analysis. With only Docker and Nextflow required, the pipeline guarantees high portability and reproducibility. The data of the original article was used to benchmark the new pipeline. Its outputs closely match those of the original study with minor variations. ConclusionsMOAflow demonstrates how the adoption of robust WMS can substantially enhance the performance and usability of pre-existing bioinformatic pipelines. By leveraging containerization and Nextflow, it ensures consistent results across platforms while minimizing setup complexity. This work highlights the value of modern WMS-driven approaches in meeting the computational demands.
Li, P.; Yu, Y.; Feng, J.; Huang, S.; Zhang, J.
Show abstract
Sepsis can lead to acute respiratory distress syndrome (ARDS) and is associated with a high mortality rate. This study investigated cellular senescence-related genes in sepsis and sepsis-induced ARDS to identify novel biomarkers. Using bioinformatics analyses including WGCNA and machine learning on public datasets, six hub genes (NFIL3, GARS, PIGM, DHRS4L2, CLIP4, LY86) were identified. These genes showed strong diagnostic value and were associated with immune cell infiltration and key pathways. Validation in lipopolysaccharide (LPS)-stimulated neutrophils showed significant upregulation of NFIL3. The findings highlight the role of cellular senescence in pathogenesis and identify promising therapeutic targets for sepsis-induced ARDS.
Abdelhamid, A.; Saad, e.
Show abstract
BackgroundInterferon-gamma (IFN-{gamma}) is the primary effector cytokine of adaptive anti-tumor immunity, yet it paradoxically induces a potent immunosuppressive tumor microenvironment (TME). The full mechanistic scope of this paradox in head and neck squamous cell carcinoma (HNSC) has not been characterized at the transcriptomic scale. MethodsUsing TCGA HNSC RNA-seq data (n = 522), we applied an integrated computational pipeline: Spearman correlation analysis, principal component analysis (PCA), UMAP, K-means clustering (k = 4), Random Forest regression, deep neural networks, permutation importance, JAK-STAT cascade mapping, and DNN-based transcriptome-wide mediation analysis across 57 IFN-{gamma} pathway and 78 immunosuppressive genes. ResultsIFN-{gamma} pathway activity was universally and positively correlated with six immunosuppressive axes, including checkpoints (CD274; LAG3; IDO1), Tregs, myeloid suppression, and tryptophan catabolism. K-means clustering identified four immunologically distinct tumor subgroups. DNN models predicted suppressive TME. Permutation importance identified IRF8 as the dominant mediator linking IFN-{gamma} signaling to immunosuppression. DNN mediation analysis identified PDCD1LG2 (PD-L2) as the strongest intermediary between IFNG and PD-L1 regulation, followed by JAK2 and GBP5. ConclusionsIFN-{gamma} orchestrates coordinated immunosuppression in HNSC through JAK-STAT-IRF8 signaling. PDCD1LG2 and JAK2 are actionable mediators of this paradox, supporting combination strategies co-targeting IFN-{gamma}-induced checkpoint induction and direct checkpoint blockade in HNSC immunotherapy. GRAPHICAL ABSTRACT
Trummer, N.; Weyrich, M.; Ryan, P.; Furth, P. A.; Hoffmann, M.; List, M.
Show abstract
Anti-hormonal therapies such as selective estrogen receptor modulators like tamoxifen or aromatase inhibitors like letrozole represent a cornerstone for breast cancer prevention and therapy of estrogen receptor-positive breast cancer. Therapeutic monitoring can include blood tests and imaging; however, genetically-based approaches are not yet in practice. Ideally, a test would be able to detect a positive molecular response across different estrogen pathway-suppressive approaches. Circular RNAs are a species of non-coding RNAs detectable in plasma that have been proposed as non-invasive therapeutic biomarkers. To determine whether a set of specific circular RNAs is altered across estrogen-suppressive pathway approaches, we analyzed mammary gland-specific total RNA sequencing data from two individual genetically engineered mouse models (GEMMs) of estrogen pathway-induced breast cancer, with or without exposure to tamoxifen or letrozole. The nf-core/circrna pipeline was used to identify circRNAs that were differentially expressed in response to either tamoxifen or letrozole. We then screened for circRNAs that were differentially regulated by both anti-hormonals. Four up-regulated and 31 down-regulated circRNAs with host genes known to be expressed in human breast epithelial cells were identified as showing reproducible differential regulation in response to anti-hormonal treatment.
Iftehimul, M.; Saha, D.
Show abstract
Extrachromosomal DNA (ecDNA) has emerged as a critical mediator of oncogene amplification and transcriptional dynamics in aggressive cancers, yet its contribution to chemotherapy resistance in vivo remains incompletely understood. This study investigates the contribution of ecDNA-associated molecular features to predictive chemotherapy resistance in TNBC. We analyzed RNA-seq data from 4T1 TNBC cells and 4T1 bulk tumors at different growth stages (1-, 3-, and 6-week) to identify differentially expressed ecDNA alterations. We then utilized molecular docking tools to predict ecDNA protein-drug interactions and employed machine learning (ML) models to predict ecDNA-associated therapeutic resistance. Our results revealed changes in global gene expression, including expression of ecDNA-associated genes, that continued over time, with significant molecular remodeling observed at six weeks. Additionally, we found gradual accumulation of mutations in ecDNA genes, which may have contributed to reduced drug binding affinity, indicating potential resistance. ML models generated stable, high-confidence classifications of resistant phenotypes, consistently identifying ecDNA burden and prevalence as dominant predictive features of drug resistance. Drug specific predictions further highlighted elevated resistance probabilities for paclitaxel and doxorubicin, whereas hydroxyurea, which depletes ecDNA, showed reduced resistance probabilities, indicating potential roles of ecDNA in chemoresistance. This study provides new insights into temporal remodeling of ecDNA within TNBC tumors over time and their potential association with drug resistance.